CATD: a reproducible pipeline for selecting cell-type deconvolution methods across tissues

Abstract Motivation Cell-type deconvolution methods aim to infer cell composition from bulk transcriptomic data. The proliferation of developed methods coupled with inconsistent results obtained in many cases, highlights the pressing need for guidance in the selection of appropriate methods. Additionally, the growing accessibility of single-cell RNA sequencing datasets, often accompanied by bulk expression from related samples enable the benchmark of existing methods. Results In this study, we conduct a comprehensive assessment of 31 methods, utilizing single-cell RNA-sequencing data from diverse human and mouse tissues. Employing various simulation scenarios, we reveal the efficacy of regression-based deconvolution methods, highlighting their sensitivity to reference choices. We investigate the impact of bulk-reference differences, incorporating variables such as sample, study and technology. We provide validation using a gold standard dataset from mononuclear cells and suggest a consensus prediction of proportions when ground truth is not available. We validated the consensus method on data from the stomach and studied its spillover effect. Importantly, we propose the use of the critical assessment of transcriptomic deconvolution (CATD) pipeline which encompasses functionalities for generating references and pseudo-bulks and running implemented deconvolution methods. CATD streamlines simultaneous deconvolution of numerous bulk samples, providing a practical solution for speeding up the evaluation of newly developed methods. Availability and implementation https://github.com/Papatheodorou-Group/CATD_snakemake.


Glossary of key terms
Cross-reference deconvolution: Supervised deconvolution task where the reference and the pseudo-bulk come from different samples, datasets, or technologies.

Deconvolution :
Cell type deconvolution is a computational technique that aims to infer the relative proportions or abundances of different cell types within a heterogeneous biological sample, such as a tissue or a mixture of cells.This approach leverages gene expression or other molecular data (such as methylation) from a sample to estimate the contributions of individual cell types that make up the complex mixture.
Self-reference deconvolution: Supervised deconvolution task where the reference and the pseudo-bulk come from the same dataset after splitting it into training (reference) and testing (pseudo-bulk).

Snakemake workflow manager:
Snakemake is a workflow management system and domain-specific language used that automates and manages the execution of complex, multi-step data analysis pipelines.It allows users to define the steps, inputs, outputs, and dependencies of a workflow in a human-readable format.Then it automatically manages the execution of these steps, ensuring that they are executed in the correct order while taking advantage of available computational resources efficiently.This tool is particularly valuable in bioinformatics research where reproducibility, scalability, and ease of use are essential.10X : 10x Genomics' single cell RNA-seq (scRNA-seq) technology, the Chromium Single Cell 3' solution, allows you to analyse transcriptomes on a cell-by-cell basis through the use of microfluidic partitioning to capture single cells and prepare barcoded, next-generation sequencing (NGS) cDNA libraries.
Smart-Seq2(SS2): Smart-seq2 is a single-cell RNA sequencing (scRNA-seq) technology that enables comprehensive transcriptome analysis at the single-cell level.It is characterised by its full-length cDNA synthesis approach, providing high-resolution insights into the gene expression profiles of individual cells.

Supplementary Methods
Explanation of the 3 streams of CATD pipeline 1. Self-reference deconvolution pipeline.Description: For this task we input a single-cell dataset in Seurat or Anndata object format.The dataset is split in two, the training dataset that will be used as a reference and a test dataset that will be used to generate simulated bulk RNA-seq data.
The first step of the pipeline includes converting the data from anndata format to seurat object using the sceasy package.
Next, the dataset is split in half, the first half will be used for reference generation and the second half for pseudo-bulk generation.In this step the seed can be specified for reproducibility.We also make sure that all the cell types in the single-cell dataset have been included in both halves.The next step of the pipeline happens in parallel and includes the generation of suitable references -that will be later used as inputs in deconvolution methods-and simulated bulk data(see details in the methods below).Once the references and pseudo-bulk files have been generated normalisation and transformation of pseudo-bulk and reference matrices is performed where it is relevant.Next, the relevant generated inputs are fed in the deconvolution methods and cell-type predictions are obtained.The first output of the pipeline is a matrix of cell-type proportions per sample for each deconvolution method which is saved in RDS format under the directory 'Output' .
Since in self-reference deconvolution ground truth proportions are known (determined while generating pseudo-bulk profiles) evaluation metrics are calculated across samples for each method.The output of the evaluation is a RDS-formatted file for each metric that includes a list of values of this metric for each deconvolution method.These files are stored under the directory 'Metrics'.Finally, an evaluation of the scalability is performed for each method, the table of values is stored under 'Benchmarks'.The values from the metrics and benchmarks file are used for visualisation of the performance and comparison of scalability across methods.Violin plots and bar charts are used to visualise the above.Resulting plots are then stored under the directory 'Plots'.(Supplementary Figure 11) 2. Cross-reference deconvolution pipeline.
Description: The main steps of the cross-reference pipeline are the same as described in the previous section.However, in this task we need two input single-cell datasets instead of one.The first dataset will be used for the generation of pseudo-bulk(and the ground truth) and the second for the generation of the references.For this task it is required that both datasets originate from the same tissue and hence, they include the same cell-types.In addition, the cell-type labels should be matched to allow evaluation of the results.The primary objective of the cross-reference pipeline is to facilitate the implementation of controlled in silico experiments,through which we can investigate how differences in gene expression between (pseudo)bulk and reference , stemming from sample differences, technologies, or studies, can influence the deconvolution process.Unlike self-reference deconvolution this pipeline introduces more noise and simulates a more realistic deconvolution scenario.(SupplementaryFigure 12)

Real bulk deconvolution pipeline.
Description: This part of the pipeline was designed for the real bulk deconvolution as well as for future users to deconvolute their own data across the implemented methods.This pipeline is split into two smaller pipelines depending on the presence or absence of ground truth proportions.When ground truth data are present(indicated in the config.yamlfile as realBulk: 1) the users need to provide (a) bulk expression matrix to be deconvolved (b) a matrix of known ground truth proportions (c) single-cell data in anndata or Seurat object format.Moreover the users can decide regarding the preprocessing parameters(normalisation transformation of both the bulk and single-cell matrix), which statistical test will be used for marker selection, which deconvolution methods will run, as well as if a consensus of EpiDISH,FARDEEP and DWLS will be calculated.The outputs of the pipeline will be the same as described in the self-reference deconvolution section.In cases where there is no knowledge regarding the ground truth proportions (parameter realBulk-noprop : 1 ) users can still run the pipeline by providing only the bulk and the single-cell reference of their choice.Since in this case evaluation metrics cannot be calculated for each method instead we perform correlation of results across methods to inform the user regarding the concordance of methods in predicting proportions.(SupplementaryFigure 13)

Implementation of deconvolution methods
Here is the list of deconvolution methods currently implemented in the pipeline ordered alphabetically with their package version numbers.

Implementation of performance evaluation metrics
The pipeline can evaluate the predictions of deconvolution methods through various metrics when ground truth proportions are available(from pseudo-bulk or real data).The metrics that have been implemented in the pipeline are listed below: Where is the mean of the i-th row of the matrix X and is the mean of the i=th row of matrix   w i is defined as the inverse of the ground truth proportion for the i-th cell type Where p i is the ground truth proportion for the i-th cell type.
This modification of RMSE ensures that rare cell types, characterised by lower ground truth proportions, contribute more significantly to the weighted RMSE.
All the metrics have been estimated in R and the functions can be found here https://github.com/Functional-Genomics/CATD_snakemake/blob/main/Modules/Res_explore/Res_explore.R.For each metric, the scores of the methods are saved as named list objects throughout the pipeline.Graphs are also constructed from these lists for initial viewing of the results. Abbreviations Let: be the ground truth matrix, where N is the number of cell types and M is the number of samples  ∈   be the predicted matrix  ∈   • N is the number of cell types • M is the number of samples • represents the element in the -th row in the -th column of matrix     • represents the element in the -th row in the -th column of matrix     1. Root Mean Squared error(RMSE)/Root Mean Squared Deviation(RMSD)

|
) -R( ) is the difference between the ranks for the element in the i-th row and the j-th column for each  correlation Let be a vector of values from ground truth matrix X and be a vector of values from predicted matrix  ,) (,)dCov & dCor = Distance covariance & correlation as described in this study 50Note that dCov(X,X) = dVar(x), which is the distance variance.n = number of observations(number of values in each vector and  D( , = centered Euclidean distances between the i_th and j_th observations of xFor p = 2 it is simplified to Euclidean Distance which is the metric used in the CATD pipeline.

9 .
Weighted RMSELet w i be the weight assigned to the i-th cell type based on its ground truth proportion.The weighted RMSE(wRMSE) can then be calculated as: